We begin with SynMap output based on the Schnable 2011 paper (see my notes for details on parameter and download options).

Expression data was retrieved from NCBI: GSE50191 and was reported in Walley 2016. This set includes 69 experiments (3 biological replicates, 23 tissues) for 62,547 mRNAs aligned to Zea mays B73 RefGen_v2 5a WGS annotation set.

Protein abundance data was retrieved from Walley 2016 supplamental table S2. This dataset contains 3 to 7 biological replicates over 33 tissues. Expression data is based on a subset of this protein data tissues. Values are in dNASF, which is summarized here.

1 Import SynMap and GO Term Data


We parse the SynMap output file with ks/kn values (i.e. a DAGChainer file) to a flat format using a perl script (parseKsKnFile.pl) and import the data into R. Some ks and kn values are empty, which is to be expected. These are ignored for the purposes of finding the median, but included in the results if they are in a block that meets the cutoff criteria. GO terms related to Maize gene models are imported as well.

Lets take a look at what we just imported.


2 Choosing a KS cutoff value


We need to select a cutoff value for ks (synonymous mutation rate) to differentiate between the most recent alpha duplication event and the previous beta duplication. Looking at ks value frequency (left), we should not see any obvious cutoff point. By transforming ks values to block median values, we should see two (or more) peaks (center). In this (center) graph, the first peak represents orthologs and the second represents homeologs from “pregrass tetraploidy”. If the vertical line separates the first two peaks sufficently, it is probably a good enough value. If not, we need to pick a new cutoff value for ks. Viewing the log10 transform of the ks values helps differentiate types of homologs (better version of this graph with color can be generated in CoGe).

You have selected the log10(ks) cutoff value to be 0. Again, this should ideally separate two peaks in the middle graph.

Fig 2.1:

Fig 2.1:

Fig 2.1:

Fig 2.1:

Fig 2.1:

Fig 2.1:

3 Major Syntenic Blocks


Regions of synteny are not confined to a 1:1 relationship between Sorghum and Maize chromosomes. Where multiple regions in Maize are syntenic to the same region on Sorghum, we can assume this is a result of the whole genome duplication even. These are the regions which will be sorted into subgenome 1 and subgenome 2.

Table
Chromosome b34889_1 b34889_2 b34889_3 b34889_4 b34889_5 b34889_6 b34889_7 b34889_8 b34889_9 b34889_10 total
a31607_1 2431 na na na 710 na na na 736 na 3877
a31607_2 na 895 na na na na 1532 na na na 2427
a31607_3 na na 1914 na na na na 1174 na na 3088
a31607_4 na na na 893 1475 na na na na na 2368
a31607_5 na 114 na 296 na na na na na na 410
a31607_6 na 1055 na na na na na na na 747 1802
a31607_7 486 na na 412 na 108 na na na 164 1170
a31607_8 206 na 192 15 na 21 na na na 229 663
a31607_9 na na na na na 765 na 419 na 58 1242
a31607_10 na na na na 202 406 na na 700 na 1308

There are 0 unaligned sorghum chromosome(s) in the above table. If there are any unaligned sorghum chromosomes, consider picking a new ks cutoff.

4 SubGenomes


In order to separate subgenome 1 from subgenome 2, we implement a greedy algorithm to collect non-overlapping (when projected on sorghum) syntenic blocks by size. Psuedocode of the algorithm is as follows: foreach sorghum chromosome, set the largest syntenic chromosome (by gene) as sub1. Foreach additional syntenic chromosome if doesn’t overlap what is already in sub1, add it to sub1. Else, if it does not overlap anything in sub2, add to sub2. Else toss. After all syntenic chromosomes are process for this sorghum chromosome, count the number of genes in sub1 and sub2. If sub2 is bigger, switch them. *Toss out “small” syntenic chromosomes (51 genes or less?), and consider a tolerance for how much overlap must occur to consider it a true overlap. Also consider switching to gene level overlap rather than whole-chromosome start/stop values.

In order to see if one subgenome tends to have more isoforms, we plot the density of isoform counts from each subgenome based on isoforms in the v4 release 32 GFF file. We scale the x axis using log10, as we have a very long tail on this data (i.e. most counts are 1, but a few range to nearly 600). The densities track each other very strongly, so there isn’t likely anything to look for here.

5 Compared to Paper

How many chromosomes with synteny to sorghum were correctly placed in subgenome 1 or subgenome 2? Items in “Paper” only are missed subgenome assignments. Items in “Greedy” only are false assignments. Items in center are correct. (currently sorted by hand)

6 GO Terms in Maize


First we show both how many unique GO terms have been assigned using a computational ev-code vs. an experimental ev-code. A few terms have been annotated using both or were assigned multiple times based on different publications, which is why n is greater than the number of unique GO terms in our set. We also break down the GO terms by type. A few terms may not have a type, typically because they have been made obsolete.

Second, we look at the same divisions but this time for all gene annotations (i.e. gene + GO Term + Ev-Code). Again, since some terms are assigned computationally and experimentally to the same gene or assigned multiple times based on different publications, n is greater than the number of unique GO term assignments in out set. The main takeaways here are that computationally assigned GO terms vastly outnumber experimental annotations, most annotations use MF terms, and there are a few terms falling through the cracks most likely due to being outdated.

7 GO Term Comparison by Subgenome


GO Terms by subgenome. If the GO Term is assigned to any gene in the subgenome, it is counted once for that subgenome. This is a presence/absence test.

The main takeaways here are that 1) there are differences in functional assignments between each subgenome, regardless of which filters I apply. 2) the GOSlim for plants is extremely restrictive, and probably not useful. 3) Sub1 has more functional assignments than sub2, although this may be a direct correlation to the larger size (by definition) of sub1.


GO Term assignments to genes by subgenome, which takes into account the gene-GO Term pairings and considers homeologs equivalent for the purposes of determining overlap in assignments.

The main takeaways here mirror the assessment from the GO Term presence/absence check above. There are many differences in assignments, although we can assume that the same terms are reused very often based on the differenced between unique terms and gene annotations.

8 Expression


Cases where the maize1 homolog is more or less expressed than the maize2 homolog

Overall expression patterns between subgenome 1 and subgenome 2 are highly similar. Below we show to views of the same data. In the density graph, we can see that subgenome 1 is slightly shifted to the right of subgenome 2 expression, but otherwise follows the same shape. Viewed as a boxplot, you can again see that the quartiles in subgenome 1 and 2 are largely the same. This also shows a few high expressing outliers on subgenome 1.

Paired t-test suggests that there is no difference in means between FPKM of homeologs in sub1 vs. sub2 (p-val > 0.05, so we accept null hypothesis of equal means):

## 
##  Paired t-test
## 
## data:  expressedPairs$FPKM_mean1 and expressedPairs$FPKM_mean2
## t = 2.3561, df = 3006, p-value = 0.01853
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   8.214567 89.696047
## sample estimates:
## mean of the differences 
##                48.95531

I am curious if our definition of sub 1 and sub 2 (i.e. the larger syntenic block is defined as sub 1) is really the best way to decide which genes are in sub 1 and sub 2. It seems to me that if one could resonably expect that, given 2 identically duplicate genes, if there was no benefit to having duplicated function then it would be a matter of random chance which of the two duplicate genes would loose function over time. It also seems to me that the quickest way to “degrade” an unnecessary gene (evolutionarily speaking) is to reduce expression. So what happens if we pick the higher expressing gene to be in sub 1 and the lower expressing gene to be in sub 2? Not surprising, we can find differences in populations between high and low expressing pairs. I wonder if this holds across different maize lines?

9 Compare Homeolog Pairs


In all cases where there is a homeolog on both maize1 and maize2, the we want to look to several measures to see if the genes on one subgenome tend to be present in greater abundance compared to the other. We can define this comparison as either greater expression, greater protein abundance, or a greater number of isoforms available to the gene. We can compare using a factor to define a cutoff for “greater”. (i.e. 4-fold higher expression/abundance etc.).

First we look at expression values. We find that maize1 consistently overexpresses the maize2 pair on average across all available data, regardless of factor (1,2, or 4), and regardless of whether or not we allow NA values to be treated as ‘0’.

Fig 9.1:

Fig 9.1:


We can see this same trend for all experiments across all factors (fold-difference of 1 to 9 shown). The blue line (neither gene has greater expression) grows as factor increases since we have a stricter and stricter definition of what counts as greater expression. Red (maize1) is always above green (maize2) indicating that more maize1 homeologs have greater expression over their maize2 pair than vice versa for all expriments and all factors. Red and green lines both tend to zero when factor is large enough. Protein data (not shown) follows this same pattern.

Fig 9.3:

Fig 9.3:


The number of isoforms each gene of the homeolog pairs is thought to express follows a pattern where maize1 tends to have more isoforms than maize2 homeologs.

Fig 9.4:

Fig 9.4:

Looking at a few individual cases (* these are calculated as maize1 > maize2, since maize2 > maize1 makes a negative fold change):

Top ten differentially expressed gene pairs (in log2)
Maize1_url Maize2_url Sample Value_maize1 Value_maize2 foldChange
Zm00001d045431 Zm00001d036086 Endosperm_16DAP 38964.57 7.86 12.27535
Zm00001d021620 Zm00001d006475 Tip_Stage2_Leaf_V7 16318.20 3.50 12.18684
Zm00001d032535 Zm00001d013919 Endosperm_16DAP 13127.93 3.07 12.06211
Zm00001d032535 Zm00001d013919 Whole_Seed_24DAP 5431.91 1.36 11.96364
Zm00001d021620 Zm00001d006475 Tip_Stage2_ Leaf_V5 17530.37 5.02 11.76988
Zm00001d045431 Zm00001d036086 Endosperm_20DAP 9593.93 2.83 11.72710
Zm00001d021620 Zm00001d006475 Pooled_Leaves_V1 19171.87 6.40 11.54863
Zm00001d021620 Zm00001d006475 Leaf_12DAP 3648.25 1.27 11.48816
Zm00001d029124 Zm00001d047493 Endosperm_12DAP 5249.53 1.94 11.40192
Zm00001d017211 Zm00001d051093 Immature_Leaves_V9 3670.42 1.38 11.37706

Look at the top 10 pairs with highest isoform count diffs (foldChange in log2)

Top 10 pairs with highest isoform count diffs (foldChange in log2)
Maize1_url Maize2_url Value_maize1 Value_maize2 foldChange
Zm00001d044186 Zm00001d011386 315 5 5.977280
Zm00001d040748 Zm00001d009220 92 2 5.523562
Zm00001d028579 Zm00001d047831 34 2 4.087463
Zm00001d043900 Zm00001d011572 27 2 3.754888
Zm00001d040279 Zm00001d008338 23 2 3.523562
Zm00001d038190 Zm00001d010282 30 3 3.321928
Zm00001d027848 Zm00001d048318 18 2 3.169925
Zm00001d019087 Zm00001d007770 18 2 3.169925
Zm00001d040193 Zm00001d008398 86 10 3.104337
Zm00001d017798 Zm00001d051590 46 6 2.938600

Here is a scatter plot of isoform counts across gene pairs.

Fig 9.5:

Fig 9.5:

Fig 9.5:

Fig 9.5:

12 Wrapup


Report generated on: December 01, 2017